Abstract

I describe the effects that certain factors play over

Introduction

This data set contains information about loans. Has over 113000+ observations and 81 variables. Variables include data about the loan, borrower, lenders and investors. Studying this data should help understand factors that have an effect with loan agreements. For my personal benefit, I hope to better understand the factors that could help me better obtain a comfortable mortage and pay it off to finally own a house after wishing it for years.

Slight adjustment to the data sets

Started by removing ambigous employment statuses from the data set also removed outliers which made the charts very difficult to read. In addition, created a data frame with means and medians to better compare this information.

Univariate Analysis

The most common employment status is “Employed” while the least frequent is “Not Employed”

This chart evidences that the majority of loans come from $25,000,74,999 income ranges, while a small amount comming from “Not Employed’” From the next plot, it depicts the amount of homeowners obtaining loans, against non-homeownsers. Although they have very similar counts, there are more homeowners than non-homeowners. The majority of loan amounts fall between 0 to $10,000 This chart shows the amount of loans that have been defaulted and it sets them by loan amount. $5000 loans are the most commonly defaulted. This next one also shows a similar pattern with the 3 mentioned previously.

Bivariate Analysis

This chart shows the amount of loans defaulted by income range. Again we see $25,000-49,999 as the most frequent, followed by the $50,000-74,999 range, although this one with less frequency. This chart looks like the previous one but with a much smaller range in the “y-axis”, this shows the amounts of loans completed by income range. In this boxplot, we see “Employed” borrowers are responsible for the highest median of loans, followed by “Self-Employed” and “Full-time”, while “Not Employeed” and “Part-time” are the least frequent. This other boxplot chart shows a possible positive relation between income range and loan amounts. Higher incomes represent higher loan amounts, with “$100,000+” income range, representing the highest median loan amounts, and “$1-24,999” illustrating the lowest amounts. Comparing homeowners by monthly income, this bargraph shows homeowners having higher monthly incomes. Buidling a scatterplot, a higher density of dots take place between 0 to $10000 monthly income and 0 to $10000 loan amount. Comparing credit scores with loan amounts, there is another possible relation between these two variables up until credit scores above 800, where loans start to decrease. Borrower rates will decrease with higher credit scores. Borrower APR behaves very similar to rates, infact, the APR is built with rates in addition with loan fees, This is chart represents the strongest relation between two variables in my analysis. Lender yield is strictly dependent on the borrower’s rate. The more the borrower pays, the higher the lender’s yield. If we run a Pearson correlation test between these two, we get a coefficient of 0.9910457 which is a very strong relation.

## 
##  Pearson's product-moment correlation
## 
## data:  loans2$BorrowerAPR and loans2$LenderYield
## t = 2322, df = 97711, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9909477 0.9911709
## sample estimates:
##     cor 
## 0.99106

However, losses also increase with higher APRs. Since there is a relationship between borrower APR and estimated losses, lender yields also have relation with estimated losses. Lender’s median yield decrease with higher “Prosper Scores”. Prosper Scores are a score on the likelyhood of a borrower to pay back, higher scores represent borrowers with better probability of paying back. The relationshipt between employment status and median monthly income does not seem to indicate anything significant, other than a lot of noise with more tenure. The more open credit lines a borrower has, the more mean income can be expected from the individual. After running a Pearson correlation test, there seems to be a weak relation however, with a coeficient of 0.2826184.

## 
##  Pearson's product-moment correlation
## 
## data:  loans2$OpenCreditLines and loans2$StatedMonthlyIncome
## t = 91.229, df = 97711, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2743744 0.2859303
## sample estimates:
##       cor 
## 0.2801625

In this chart, we can almost observe an inverted parabola, where initially mean monthly income increases with more inquiries in the last 6 months, but later on decreases after about 6 inquiries. A line chart may not be the best one to represent these two variables, but it shows that there is a initial decrease in mean monthly income, only to increase after 2 delinquencies. There seems to be more delinquencies with the more recent inquiries the borrower has. Borrowers seem to have less delinquencies with more credit lines. Users with almost 11 credit lines, have close to 0 delinquencies. Mean loan amounts have a negative relationship with delinquencies. The lower the amount, the more delinquencies the borrower has.

Multivariate Analysis

The following charts use a boxplot to compare mean loan amounts between homeowners and groups them by income range. The higher loans are coming from incomes $75,000 to $100,000+. This other chart in similar fashion as the one above, compares media loan amounts between homeowners and non-homeowners, and groups them by employment status, where “Employed” and “Self-employed” have the highest medians.

This chart shows a possible positive relation between the loan amounts and the loan terms, With 12 month terms representing lower amounts and 60 month terms representing the highest loans. As previously hinted, the next chart also shows a possitive relation between loan amounts and income, the more income, the higher the loan amount.

These two bargraphs compare debt to income ration by income range and if the borrower is a homeowner. It seems that homeowners with incomes between $1-24,999 have the highest debt to income ratios, and this ration abruptly decreases afterwards. This scatterplot supports the idea of more income representing higher loan amounts, where yellow dots represent the highest income, you can see a good concentration of high amount loans in the yellow section. This chart expands a bit more on the previous chart, by comparing the monthly income and loan amounts by laon terms. Previously we had seen a chart showing a negative relation between Prosper Scores and lender yields. This supplements that idea, however, it shows a littly odity with income ranges between $1-24,999 where there is a bit of a spike at Prosper Score 6. Since there is a strong relationship between lender yield and borrower APR, this chart looks very similar to the previous one, but this has a different y-axis.

Final Plots and Summary

The most interesting observations I found among this data, was not the relations between variables, but instead the lack of relation between variables where I was expecting the opposite. One thing I was curious about was seeing which credit score were defaulting the most, before plotting the chart, I pictured lower credit scores defaulting the most, however, since most of the loans are given to borrowers with scores around the 700 vecinity, these scores were also the ones reporting most default loans likely due to the portion of the borrowers they represent.

Default Distribution by Credit Score

Income by Credit Score Where Laons Defaulted

## 
##  Pearson's product-moment correlation
## 
## data:  loans2$CreditScoreRangeLower and loans2$StatedMonthlyIncome
## t = 63.256, df = 97711, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1923109 0.2043578
## sample estimates:
##       cor 
## 0.1983418

Income by Credit Score Where Laons Defaulted

One thing I was very confident about was the fact that people with higher bankcard utlization would be more likely fail to pay on time. But my assumption was wrong again. In fact, this is one of the weakest relationships I explored, almost 0.

## 
##  Pearson's product-moment correlation
## 
## data:  loans2$BankcardUtilization and loans2$CurrentDelinquencies
## t = -21.313, df = 97711, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07426319 -0.06178106
## sample estimates:
##         cor 
## -0.06802478

Reflection

When I moved into this country, I had to build a credit score, given the fact that I had none. It was a struggle, because no financial institution would even let me open a credit card with them and how are you supposed to build a credit score if no one gives you the opportunity to build credit? Well, you start with a secured credit card, which essentially is paying the bank fees and interest for you to borrow from your own money, but these payments are reported to credit bureaus and that’s how you start buidling credit history. I knew that I would give my 100% to the bank to pay them back, but me telling them did not mean anything, because they really didn’t know anything about me. Data speaks for itself.

Banks have performed these type of analysis thousands and thousands of times, likely with much more depth than this, so when they ask you about your credit score, your current debt, income and more, it is for a reason. Although you may think, these factors do not apply to you, because you know you will pay back, the bank has no way of measuring your ability to pay by just listening to you say so. In a greater sense, these factors are acuarate and help minimize the losses to both borrowers and lenders.

From a technical perspective, most of the challenges I faced while building these plots came from understanding or finding a plot that depicts data that makes sense. I had the variables and knew what I was looking to see, however building was hard, I either had line graphs with way too much noise, irrational bar graphs or scatter plots with lines of dots that made no sense.

To further improve this project I would like to add pie charts. CurrentlyI can build simple ones, but was not able to build one with the data from this set, as I lacked the knowledge to do so, even after hours of researching online how to possibly do this. I would also like to build better line graphs that actual have a continuos X axis variable.